## Self-supervised Category-Level Articulated Object Pose Estimation with Part-Level SE(3) Equivariance

This folder contains the code (in PyTorch) for the submission: Self-supervised Category-Level Articulated Object Pose Estimation with Part-Level SE(3) Equivariance. 

### Contents

- Requirements
- File Structure
- Data
- Usage
- References

### Requirements

```python
# Python 3.8.8, CUDA version 11.6
torch==1.9.1
torch_cluster==1.5.1
torch_scatter==2.0.7
pyrender== 0.1.45
trimesh==3.2.0
```

### File Structure

```shell
SPConvNets/ # containing main model files
vgtk/ # externel dependencies
extensions/ # externel dependencies
chamfer_distance/ # externel dependencies
README.md # this file
run_unsup_ToySeg.py
```

### Key Implementations

For two main components of our method, we provide their references to the code: 

- **Part-level SE(3) Equivariant Network**: We mainly refer to the module `BasicSO3PoseConvBlock` in the file `SPConvNets/utils/base_so3poseconv.py`. Relevant packages/folders/files also include `vgtk/vgtk/so3conv`. 
- **Shape, Structure, Pose Disentanglement Strategy:** We mainly refer to the file `SPConvNets/models/unsup_seg_so3_pose_conv_pn_35.py`. 
  - **Kinematic Chain Prediction**: Use the keyword `Kinematic Chain Prediction` to search for relevant codes (between two commented lines that contain the keyword). 
  - **Part Proposal**: Please refer to `SPConvNets/utils/slot_attention_spec_v2.py` for details. 
  - **Canonical part shape reconstruction, Joint parameter prediction, Part-assembling Parameters Prediction. Joint States Prediction, Base Part Rigid Transformation**: Use the keyword `Canonical part shape reconstruction, Joint parameter prediction, Part-assembling Parameters Prediction. Joint States Prediction, Base Part Rigid Transformation` to search for relevant codes (between two commented lines that contain the keyword). 
  - **Shape Reconstruction**: Use the keyword ` Shape Reconstruction (Ref.)`to search for relevant codes (between two commented lines that contain the keyword) for how we compose predicted information together for per-part per-rotation shape reconstruction, and further the per-rotation object shape reconstruction.  
  - **Joint Regularization**: Use the keyword `Joint Regularization` to search for joint regularization-related code (between two commented lines that contain the keyword). 
  - **Category-level Articulated Object Pose Evaluation Strategy**: Please refer to the function `eval` in the file `SPConvNets/trainer_unsup_seg.py` for details. 

### Data

We use seven categories from three public datasets. Including Oven, Washing Machine, Eyeglasses, and Laptop (S) from [Shape2Motion](https://github.com/wangxiaogang866/Shape2Motion), Drawer from [SAPIEN](https://sapien.ucsd.edu), and Safe, Laptop (R) from [HOI4D](https://hoi4d.github.io). 

For each shape, we generate 100 posed shapes with different articulation states. 
   - For complete point clouds, we generate 10 samples with each of them rotated by a random rotation matrix. 
   - For partial point cloud, we render depth images of complete object instances using the same rendering method described in [4]. The difference is that we manually set a viewpoint range for each category to ensure that all parts are visible in the rendered depth images. For each articulated posed shape, we render 10 depth images for it, the same number for generating shapes with arbitrary global poses.

### Usage

#### Building Dependences

```shell
cd vgtk
python setup.py install
cd ..
```

#### Training Stage

Use the following command for training:

```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 TORCH_DISTRIBUTED_DEBUG=INFO python -m torch.distributed.launch --nproc_per_node=8 run_unsup_ToySeg.py experiment -d "./data" --init-lr=1e-4 --num-iters=2 --global-rot=1 --input-num=512 --bsz=1 --nmasks=${nmasks} --use-equi=${model_idx} --part-pred-npoints=256 --inv-attn=1 --orbit-attn=0 --slot-iters=7 --dataset-type=${dataset_type} --rot-factor=0.5 --init-radius=0.20 --equi-anchors=1  --translation=0 --rot-anchors=0 --no-articulation=0 --kanchor=60 --cent-trans=3 --shape-type=${category} --soft-attn=1 --feat-pooling=mean --factor=0.99 --queue-len=200 --glb-recon-factor=1.0 --slot-recon-factor=0.5 --use-sigmoid=1 --recon-prior=6 --lr-adjust=2 --n-dec-steps=1000 --use-flow-reg=0 --save-freq=200 --use-multi-sample=1 --n-samples=100 --exp-indicator=${exp_indicator} --kpconv-kanchor=60  --sel-mode=${sel_mode} --slot-single-mode=1 --cur-stage=1 --permute-modes=1 --use-2d=0 --pred-axis=1  --mtx-based-axis-regression=False --slot-single-cd=0 --glb-single-cd=0 --resume-path=${resume_path}
```

For unspecified parameters in the command, `nmasks` should be set to the number of parts of this category, `dataset_type = [motion, hoi4d, motion_partial, hoi4d_partial]`, `sel_mode = [-1, 29]`. 

For instance, to run on Oven for complete point clouds:

```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 TORCH_DISTRIBUTED_DEBUG=INFO python -m torch.distributed.launch --nproc_per_node=8 run_unsup_ToySeg.py experiment -d "./data" --init-lr=1e-4 --num-iters=2 --global-rot=1 --input-num=512 --bsz=1 --nmasks=2 --use-equi=38 --part-pred-npoints=256 --inv-attn=1 --orbit-attn=0 --slot-iters=7 --dataset-type='motion' --rot-factor=0.5 --init-radius=0.20 --equi-anchors=1  --translation=0 --rot-anchors=0 --no-articulation=0 --kanchor=60 --cent-trans=3 --shape-type='oven' --soft-attn=1 --feat-pooling=mean --factor=0.99 --queue-len=200 --glb-recon-factor=1.0 --slot-recon-factor=0.5 --use-sigmoid=1 --recon-prior=6 --lr-adjust=2 --n-dec-steps=1000 --use-flow-reg=0 --save-freq=200 --use-multi-sample=1 --n-samples=100 --exp-indicator='example_oven' --kpconv-kanchor=60  --sel-mode=-1 --slot-single-mode=1 --cur-stage=1 --permute-modes=1 --use-2d=0 --pred-axis=1  --mtx-based-axis-regression=False --slot-single-cd=0 --glb-single-cd=0 --resume-path='./ckpt/oven_example.pth'
```

#### Evaluation Stage

Use the following command for evaluation:

```shell
CUDA_VISIBLE_DEVICES=7 TORCH_DISTRIBUTED_DEBUG=INFO python -m torch.distributed.launch --nproc_per_node=1 run_unsup_ToySeg.py experiment -d "./data" --init-lr=1e-4 --num-iters=2 --global-rot=1 --input-num=512 --bsz=1 --nmasks=${nmasks} --use-equi=${model_idx}  --part-pred-npoints=256 --inv-attn=1 --orbit-attn=0 --slot-iters=7 --dataset-type=${dataset_type} --rot-factor=0.5 --init-radius=0.20 --equi-anchors=1  --translation=0 --rot-anchors=0 --no-articulation=0 --kanchor=60 --cent-trans=3 --shape-type=${category} --soft-attn=1 --feat-pooling=mean --factor=0.99 --queue-len=200 --glb-recon-factor=1.0 --slot-recon-factor=0.5 --use-sigmoid=1 --recon-prior=6 --lr-adjust=2 --n-dec-steps=1000 --use-flow-reg=0 --save-freq=200 --use-multi-sample=1 --n-samples=100 --exp-indicator=${exp_indicator} --kpconv-kanchor=60  --sel-mode=${sel_mode}  --slot-single-mode=1 --cur-stage=1 --permute-modes=1 --use-2d=0 --pred-axis=1  --mtx-based-axis-regression=False --slot-single-cd=0 --glb-single-cd=0  --sel-mode-trans=${sel_mode_trans} --run-mode=eval --pre-compute-delta=1 --resume-path=${resume_path}
```

### References

[1] [EPN_PointCloud](https://github.com/nintendops/EPN_PointCloud)

[2] [equi-pose](https://github.com/dragonlong/equi-pose)

[3] [slot-attention](https://github.com/lucidrains/slot-attention)

[4] Li, X., Weng, Y., Yi, L., Guibas, L., Abbott, A. L., Song, S., & Wang, H. (2021). Leveraging SE (3) Equivariance for Self-Supervised Category-Level Object Pose Estimation. *arXiv preprint arXiv:2111.00190*.

### License

Our code and data are released under MIT License (see LICENSE file for details).
